Data-Driven Elucidation of Flavor Chemistry

Flavor molecules are commonly used in the food industry to enhance product quality and consumer experiences but are associated with potential human health risks, highlighting the need for safer alternatives. To address these health-associated challenges and promote reasonable application, several databases for flavor molecules have been constructed. However, no existing studies have comprehensively summarized these data resources according to quality, focused fields, and potential gaps. Here, we systematically summarized 25 flavor molecule databases published within the last 20 years and revealed that data inaccessibility, untimely updates, and nonstandard flavor descriptions are the main limitations of current studies. We examined the development of computational approaches (e.g., machine learning and molecular simulation) for the identification of novel flavor molecules and discussed their major challenges regarding throughput, model interpretability, and the lack of gold-standard data sets for equitable model evaluation. Additionally, we discussed future strategies for the mining and designing of novel flavor molecules based on multi-omics and artificial intelligence to provide a new foundation for flavor science research.


INTRODUCTION
Flavor molecules have a long history of use in food products for enhancing nasal sensations and improving taste perceptions to stimulate the appetites of consumers. 1 Beyond their key roles in defining taste and smell, some flavorings (e.g., vanillin) can increase the shelf life and stability of food products and improve their texture and appearance. 2 In the pharmaceutical industry, the addition of flavoring agents, such as cetirizine hydrochloride and famotidine, is used to mask the unpleasant odor and taste of various drugs. 3 Despite their recognized importance and wide application in industries, evidence suggests that certain flavor substances pose potential health risks. 4,5 For example, some artificial sweeteners have been associated with colitis, obesity and its related comorbidities, and metabolic dysregulation. 6 Diacetyl, a butter-flavoring compound used in plant bakeries, has been linked to increased rates of bronchiolitis obliterans, while monosodium glutamate has been linked to obesity, metabolic disorders, neurotoxic effects, and reproductive organ damage. 7 Moreover, methyl Nacetyl anthranilate, a common natural berry flavoring, has been shown to cause phototoxicity. 8 In an effort to address these health-associated challenges and promote reasonable applications, several databases, such as the Flavor Ingredient Library developed by the Flavor and Extract Manufacturers Association of the United States, 9 Additive-Chem, 5 and FlavorDB, 10 have been constructed in the past two decades, which provide comprehensive and in-depth knowledge on flavor molecules. Despite the applications of big data in food science having been summarized in a previous review, 11 no study has systematically evaluated available databases for flavor molecules through the assessment of data quality, focused fields, and potential gaps in information, which limits the further development of this field.
The perception of flavor arises from the interaction of biological machinery (e.g., the taste buds) and flavor molecules; thus, flavor perception can be regarded as an emergent property of a complex biochemical system. 10 The rapid development of computational strategies, such as machine learning (ML) and molecular simulation (MS), provides us new opportunities for unveiling underground biological mechanisms of flavor perception. Using computational strategies, we can also analyze the structural characteristics of known flavor molecules and explore the interactions between perception receptors and candidate molecules to assist in the discovery of new flavorings with positive health impacts. 12 This review summarizes databases for flavor molecules released within the last two decades and discusses the application of computational strategies for (1) identifying novel flavor molecules, (2) elucidating the molecular interaction of flavor perception, and (3) mining and designing flavor molecules based on multi-omics and artificial intelligence ( Figure 1).

FLAVOR MOLECULE DATABASES
To provide an overview of known flavor molecules, we retrieved data related to flavor molecules from academic databases such as Scopus, PubMed, Web of Science, and Figure 1. Data-driven study in flavor science. Flavor molecules in perfumes, herbs, and foods are responsible for the stimulation of human sensory perceptions. Owing to the increasing number of known flavor molecules, specialized food molecule databases were built based on data management software, such as MySQL and PostgreSQL. These databases enabled the application of computational strategies (e.g., machine learning and molecular simulation) in flavor science and, in conjunction with sensory analysis, have been successfully used to identify novel flavor molecules. With the rapid development of multi-omics and artificial intelligence, advanced computational approaches have expressed great potential in guiding the designing of artificial flavorings and the mining of natural flavor molecules. Google Scholar. We retrieved 25 flavor molecule databases, of which 14 included taste molecules, 9 contained aroma molecules, and 2 comprised both ( Figure 2 and Table 1), that contained information on molecule names, Chemical Abstract Service (CAS) registry numbers, molecular structures in a simplified molecular-input line-entry system (SMILES) format, and flavor descriptions.
The keyword co-occurrence network of flavor databaserelated articles was constructed using VOSviewer ( Figure 3A). We found that "identification" was the most frequent keyword in these publications, which implies that flavor molecule identification has been a main research focus in this field. The color and size of the circles for "database" and "taste" indicate that there are numerous databases related to "taste", which has recently become a research hotspot. The color and size of the circles for "sweetness" and "bitterness" indicate that they are the most commonly studied taste properties. The significant links found among "identification", "odorant receptor", and "olfactory receptor" suggest that one primary method to identify flavor molecules is based on the interaction between molecules and receptors. The links among "genes", "neurons", "proteins", and "cells" indicate the complex biological mechanism behind flavor perception. The color of the circles and links indicates the time when the corresponding literature was published. We found that terms "fruit", "food", "dysfunction", and "health" appeared in recent literature, indicating that there is a growing interest in studying natural flavor molecules from food and their health effects ( Figure  3A). Table 1 lists the taste molecule databases, including their focus, Uniform Resource Locator (URL), release date, data availability, and the number of molecules. Taste molecule databases were largely focused on sweetness and bitterness, as they are considered the most common tastes. Four databases specifically focused on sweettaste molecules, including SWEET-DB, 13 SuperSweet, 14 Sweet-enersDB, 15 and e-Sweet. 16 SWEET-DB 13 is the first publicly available sweetness database, containing several carbohydrate structures and their mass spectrometry data. SuperSweet 14 is the largest sweet-taste database, containing more than 8000 sweet compounds and their calories, physicochemical properties, glycemic index, origin, and other information regarding molecular receptors and targets. In contrast to SWEET-DB, 13 SuperSweet's web server interface offers a user-friendly search and a sweet tree, which groups the sweet substances into three main families (carbohydrates, peptides, and small molecules 14 16 which combines data from SuperSweet 14 and SweetenersDB, 15 to provide a comprehensive data set of 530 sweetener compounds and their relative sweetness values. Both SweetenersDB 15 and e-Sweet 16 have been utilized in machine learning to predict new sweeteners  18 the first online database of bitterness molecules, which contains >550 bitter taste compounds. It also contains information on mutations in receptors influenced by bitter molecules. BitterDB received an update in 2019, 19 which increased the number of bitterness molecules to 1041 and provided additional data on molecules' bitterness intensity, toxicity, and interactive receptors. In 2020, Bayer et al. 20 collected a data set of 247 natural compounds with bitter taste receptor activity, of which 138 were derived from food. 19 22 In addition, several comprehensive databases on molecules with sour, salty, spicy, and fresh tastes have been developed, such as AdditiveChem, 5 PhytoMolecularTasteDB, 23 and ChemTastesDB. 24 PlantMolecularTasteDB 25 contains 1527 phytochemicals from 394 plants and their taste senses (e.g., bitter, sweet, sour, fresh, salty, pungent, and astringent) and anti-inflammatory properties. A unique feature of PlantMole-cularTasteDB absent in other taste-focused databases consists of data on the evidence-based biological activity of the phytotastants. 25 AdditiveChem 5 curated >9064 types of food additives (most of which are flavorings), including information on their molecular structures, physicochemical properties, biosynthesis methods, usage specifications, risk assessment data, and related receptors. PhytoMolecularTasteDB 23 includes plant-derived flavor molecules and details on the combination of tastes resulting from the main flavor molecules found in a medicinal plant. The list includes 431 Ayurvedic medicinal plants, 223 phytochemical classes, and 438 plantderived molecules. ChemTastesDB 24 contains information on 2944 verified compounds divided into nine classes, comprising the five basic tastes (sweet, bitter, umami, sour, and salty) and four additional categories: tasteless, nonsweet, multi-taste, and miscellaneous. These databases constitute novel tools for the scientific community to expand information on taste molecules and analyze the relationships between molecular structures and flavor properties.

Aroma Molecule Databases.
In aroma molecule databases, molecule olfactory descriptions are typically named after the substance that produces the odor, such as rose fragrance, meat fragrance, and fish fragrance. Flavornet, 26 a compilation of aroma compounds found in the human odor space, was first published in 1998 and last updated in 2004. It contains 738 odorants with their associated CAS registry numbers and 2D structures. These have been classified into 197 categories based on their odor descriptions, such as almond, cabbage, cheese, and herb; however, keywords of molecule odors cannot be used to search this database. The development of SuperScent 27 in 2009 addressed this issue, offering a variety of search options based on chemical names or the molecular structures of odorants. In addition, it contains 2147 volatile compounds classified according to their sources, functions, and odor groups, as well as their chemical properties and commercial information. Unfortunately, it has not been consistently maintained. Odornetwork 28 is another database that is no longer being maintained. Kumar et al. established 526 sensory descriptions and 3016 corresponding flavor molecules from perfume, food, and agricultural and pharmaceutical industries. 28 In 2016, Ueda et al. developed a database of 792 molecules with unpleasant odors, including alcohols, aldehydes, carboxylic acids, esters, ethers, and hydrocarbons using gas chromatography−mass spectrometry. 29 Kumar et al. developed AromaDB in 2018, 30 a database providing 1321 essential oil/aroma compounds from 166 commercially used plants and their bioactivities. Moreover, the database includes additional information regarding the interaction of aroma molecules with proteins/genes. This helped to reveal the action mechanisms of aroma molecules and their potential use in treating diseases. The Food Flavor Laboratory Database was developed in 2021, providing information on 171 flavor compounds, including their CAS numbers, chemical structures, aroma thresholds, and descriptions. OlfactionBase 31 contains extensive coverage of 5109 odorants, 2067 olfactory receptors, and 874 OR-odorant pairs. In addition, it contains information on 2871 odorant-binding or pheromone-binding proteins from 190 species.
In addition to academic databases, several commercial databases are available such as the Smart Aroma Database and Volatile Compounds in Food (VCF) online database. The Smart Aroma Database contains information on >500 compounds that contribute to aroma, enabling the objective evaluation and analysis of aroma compounds using gas chromatography−tandem mass spectrometry. The VCF online database contains 9832 volatile substances in food products and their odor descriptions and aroma thresholds. In addition, several databases contain both aroma and taste properties of molecules, including the FEMA Flavor Ingredient Library 9 and FlavorDB. 10 The Flavor Ingredient Library is a database of 3012 flavor substances that includes safety assessments and publications. It provides an indispensable resource for researchers, media, and consumers seeking information on flavor ingredients whose safety has been determined to be generally recognized as safe (GRAS) by the independent FEMA Expert Panel. FlavorDB 10 contains 25,595 flavor molecules, including 2254 natural molecules, 13,869 synthetic molecules, and 9472 molecules of an unknown origin. It divides flavor molecules into 31 categories, containing records of molecular, sensory, absorption, distribution, metabolism, elimination, toxicity properties, literature sources, and flavor characteristics. It may be used to find molecules matching a desired flavor or structure, explore molecules of an ingredient, discover novel food pairings, determine the molecular essence of food ingredients, and associate chemical features with a flavor.

Current Limitations and Future Perspectives of Flavor Molecule Databases.
These data are helpful for researchers studying flavor profiles and the mechanisms of action between taste and olfactory receptors and provide chemists with convenient, high-quality data resources. However, several issues need to be addressed. For example, ∼70% of databases are not downloadable or must be requested by the authors. This limits data reuse and makes assessing data quality and integrity difficult. Furthermore, certain databases such as SweetDB 13 and SuperSweet 14 are currently unavailable, and those available are not regularly updated post publication; thus, they are unsuitable for use in current research. Another consideration is that taste molecule data have been mostly derived from sweet and bitter molecules, with >50% of taste molecule databases focused on bitterants and sweeteners. As a result, other taste sensations have received less attention. This highlights the need to further annotate molecules with sour, salty, and spicy tastes in publications and experimental records. Annotating odor categories of volatile molecules is more challenging than assigning taste categories. Most odor molecule databases divided the odors into hundreds of classes based on the substance that produces the smell, which has led to nonstandard odor names. Meanwhile, the flavor threshold and content in natural resources of most flavor molecules are yet to be included in any databases, which may limit their application in industries. These issues should be considered and addressed in future studies.
To facilitate further data reuse, we collected the known flavor molecules from these databases and subsequently removed redundancies and molecules with amphibolous descriptions (e.g., sweet-like and nonsweet). Finally, 8982 molecules with a known taste and 5046 with a known aroma were obtained, which are provided in a GitHub repository along with this paper.

SCREENING AND DESIGNING OF FLAVOR MOLECULES BASED ON COMPUTATIONAL STRATEGIES
Comprehensive data on flavor molecules provide a new opportunity for identifying novel flavor molecules based on data-driven computational strategies. The size and color of the circles representing the keywords "taste", "aroma", "machine learning", "molecular dynamics", and "molecular docking" indicate that molecular simulation and machine learning have been widely used in flavor molecule research ( Figure 3B). The links among the keywords "identification", "homology modeling", and "receptor" indicate the typical pipeline for identifying novel flavor molecules based on the interaction between receptors and molecules ( Figure 3B). Machine learning is usually used for "classification" and "regression" tasks in flavor research, with algorithms, including random forest (RF), support vector machines (SVM), and convolutional neural networks (CNN) ( Figure 3B), for example, regression prediction of the aroma thresholds and classification prediction of taste class of molecules.

Molecular Simulation.
Molecular dynamics and molecular docking are common MS methods that are used in flavor studies. 32,33 Molecular dynamics is a computational simulation of a complex biological system that describes motions, interactions, and dynamics at the atomic level. 34 This is achieved by choosing a "force field" representing all the interatomic interactions and integration of Newtonian equations, which provide the position and speed of atoms over time. 34 It has been increasingly used to explore mechanisms of interaction and conformational relationships between flavor molecules and receptors ( Table 2). Molecular docking is a technique based on the lock-and-key theory. 35 By computing the intermolecular interactions between the flavor molecules and receptors, it predicts their probable binding modes. Common types of intermolecular interactions include van der Waals forces, electrostatic forces, hydrophobic interactions, and chemical bonds. 36 By minimizing these energies, the most stable binding conformation will be identified. 36 The results of molecular dynamics and molecular docking improve our understanding of the flavor properties of molecules and serve as guidelines for downstream experimental analyses 37 ( Figure 4A).
In MS of flavor perception, the desired proteins are the receptors related to flavor perception that are distributed on the surfaces of tongue and nose. In the mammalian taste system, the heterodimer of taste receptor type 1 members 1/3 (T1R1/T1R3) functions as an umami taste receptor, taste receptor type 1 members functions as bitter taste receptosr, and T1R2-T1R3 functions as a sweet taste receptor. 38 The transient receptor potential channel members, polycystin 1 like 3 (PKD1L3) and PKD2L1, are candidates for sour taste receptors. 39 Salty taste receptors primarily include the epithelial sodium channel, sodium-specific salt taste receptor, nonspecific salt taste receptor, and a taste variant of the vanilloid receptor-1 nonselective cation channel. 40 Unlike in taste perception, aroma molecules do not specifically bind to an olfactory receptor. Conversely, an aroma molecule can bind to several olfactory receptors with varying affinities depending on their physicochemical properties. 41 Upon binding to the odor receptor, structural changes of olfactory receptors activate olfactory G proteins. The G proteins activate the lytic enzyme, adenylate cyclase, to convert ATP to cyclic AMP (cAMP).
Cyclic nucleotide-gated ion channels in the cells open in response to cAMP, allowing calcium and sodium ions to enter the cell, depolarizing olfactory receptor neurons, and transmitting information to the brain. 42 Recently, MS has been commonly used to study the interactions between receptors and small molecules to identify novel molecules with potential flavor properties (Table 2). Several studies have focused on sweetness perception and the synergic effects of sweeteners. 43−45 For example, Acevedo et al. developed a comparative model of hT1R2 and hT1R3 subunits to identify their interactions with natural, noncaloric sweeteners, including sweet proteins and glycosylated terpenoids, at the molecular level. 38 Jang et al. conducted MS using predicted structures of the TAS1R2/1R3 heterodimer to analyze the synergic effects of various sweetener blend combinations of natural and artificial sweeteners. 44 To study interactions between receptors and sweeteners, Miao et al. 43 chose eight sweeteners by molecular docking to develop sweetener-T1R2membrane systems to guide the designs of novel and healthy sweeteners. Subsequently, Acevedo et al. characterized the interaction of steviol glycosides with bitter taste receptors (hT2R4 and hT2R14) at the molecular level, leading to a better understanding of the natural sweeteners' off-flavor perception in food products. 46 In addition, MS has been used to screen and design flavor peptides. 47,48 For example, Zhang et al. used molecular dynamics to analyze the interactions between peptides and umami receptors and identified five novel peptides with stronger umami intensity than monosodium glutamate. 47 Using molecular docking, Gao et al. identified several novel umami peptides and found that Phe527 on T1R1/T1R3 was the key binding site, and hydrogen bonding, electrostatic interactions, and hydrophobic interactions were the main binding forces. 48 Moreover, MS has been successfully used to guide the designing of odor molecules. In olfactory pathways, the odorant binding protein 1 (OBP1) is the main receptor for odor recognition on the malarial vector; thus, it can be used to modulate mosquito behavior and develop new attractants or repellents. 49 Using MS and hierarchical virtual screening, Bomfim et al. successfully identified a modulator for Anopheles gambiae OBP1, indicating the potential application of MS in molecular screening and designing. 49 3.2. Machine Learning. ML is an interdisciplinary subject involving statistics, convex analysis, probability theory, and approximation theory. 50 ML fits mathematical/statistical functions on given data sets and can be subsequently applied to predict the flavor properties of compounds; thus, it is used for high-throughput screening of novel flavor molecules. Current ML-based flavor studies can be divided into two main categories: regression and classification ( Figure 4B).
For the regression task, researchers have used various fingerprints of flavor molecules as the input and flavor properties (e.g., sweetness values) as the output in ML models, which could be considered a type of QSAR model. 51 In 2002, Barker et al. developed the first QSAR model for sweetness value prediction. The model was developed using multiple linear regression (MLR) and parameters generated from molecular field research on 103 sweeteners and their sweetness levels from the literature. 52 However, molecular field-based descriptors limit the model's application domain to molecules with a similar molecular scaffold. Subsequently, several algorithms and descriptors were used to improve the performance of ML models. 15 52 The establishment of SweetenersDB 15 and BitterDB 19 largely prompted the development of ML-based sweetness prediction. Based on data from SweetenersDB, 15 Bouysset et al. developed a new ML model and implemented a freely accessible web server for sweetness prediction. 55 Using this web server, they successfully identified three natural compounds that activated the T1R2/T1R3 expressed in human embryonic kidney cells. Margulis et al. developed ML models based on BitterDB 19 to predict the bitterness of compounds, thereby guiding drug design. 56 Their results suggested that ∼25% of drugs are predicted to be very bitter, with a higher prevalence (∼40%) in COVID-19 drug candidates and microbial natural products. 56 In addition, ML has been successfully used for odor prediction. 41,57 Keller et al. 41 launched an international competition in which several teams observed the smell of a molecule and how it was perceived by humans. The resulting models accurately predicted odor intensity and pleasantness, in addition to successfully predicting 8 among 19 odors, including garlic, fish, sweet, fruity, burnt, spices, flower, and sour. 36 Binary classification is another ML task used in flavor studies; for example, it can be used to determine whether a molecule has a bitter taste. Dagan-Wiener et al. developed the ML classifier BitterPredict 58 to predict the bitterness of compounds based on their molecular structures. Using BitterPredict, they found that 77% of natural products are bitter with certainty. 58 This tool will help food scientists to identify whether certain ingredients are likely to be bitter and if taste masking is necessary. Predicting compound bitterness, therefore, by adopting taste-masking and flavor correction strategies is also crucial for solving the problem of drug compliance in children. Bai et al. developed an ML model, "Children's Bitter Drug Prediction System", which predicts whether a medicine tastes bitter. 22 Aroma property prediction also could be considered a classification task. Licon et al. developed a method based on a subgroup discovery algorithm to discriminate perceptual qualities of smells based on physicochemical properties. 59 They performed experiments on 74 olfactory qualities and demonstrated that the generation of rules linking chemistry to odor perception was possible, providing a new understanding of the relationship between stimuli and olfaction perception. 59 However, these ML classification models are limited by the availability of negative samples (e.g., non-sweet and non-bitter molecules) owing to the lack of reports in the literature. To address this issue, several studies have proposed different strategies based on known data to predict the sweetness or bitterness of a molecule. 12  implemented ML models to predict three different taste end points, including sweet, bitter, and sour, which achieved an overall accuracy of 90% by 10-fold cross-validation. 62 Chacko et al. developed ML models for predicting odor characters using several ML algorithms, such as RF, gradient boosting, and SVM, and 196 two-dimensional RDKit molecular descriptors as the models' inputs. 63 In addition to traditional features, such as physicochemical properties and molecular fingerprints, features extracted from mass spectra have also been used for ML modeling. 64,65 For example, Nozaki et al. designed a novel predictive model which utilized mass spectrometry data with nonlinear dimensionality reduction and natural language processing. 65 ML can also be utilized for the identification of flavor peptides. Jiang et al. developed iBitter-DRLF for the flavor property prediction of peptides based on sequence embedding techniques, soft symmetric alignment, unified representation, and bidirectional long shortterm memory. 66 These ML models have achieved great performance; however, recent studies have shown that a molecule can have multiple tastes or aromas (e.g., taste both "bitter" and "sweet"). 67 Data we collected from these publicly available databases are consistent with these findings, revealing that 5% of collected molecules have multiple tastes, and 78% have multiple aromas. Thus, the task of classifying molecule flavor is more suitable to be considered a multilabel classification (generate multiple outputs) than a multiclass classification. Recently, Li et al. designed an ML model to identify the odor perception descriptors using multioutput linear regression models, which solved this issue. 68 Several screening pipelines combining ML and MS to identify novel flavor molecules have been developed to achieve more accurate prediction. For example, Goel et al. designed a framework comprising QSAR models and molecular docking for identifying possible sweeteners from natural molecules. 69 Xiu et al. developed an in silico pipeline to identify novel umami-tasting molecules in batches from SWEET-DB 13 and BitterDB 19 databases via principal component analysis, QSAR modeling, molecular docking, and electronic tongue analysis. 70 They identified 18 novel umami molecules using the pipeline via an electronic tongue analysis. 70

Limitations and Future Perspectives of Computational Strategies.
Numerous studies have demonstrated the advantages of MS and ML for flavor molecule studies, but with limitations. For example, most studies for predicting novel flavor molecules require more experimental validation (e.g., enose, e-tongue, and sensory validation), which reduces their reliability. Furthermore, some previous models are not opensource; therefore, readers cannot replicate the algorithm and verify its accuracy. Meanwhile, most of these prediction models do not provide an online application programming interface. Therefore, flavor chemists without specialized knowledge of computational techniques may find these tools difficult to use.
Both MS and ML have notable limitations. MS relies heavily on high-performance computing resources, which limits its speed and throughput. To accelerate the screening process, Gentile et al. developed Deep Docking, a deep learningassisted molecular docking software that utilizes QSAR models to approximate the docking outcome for unprocessed entries, thereby removing unfavorable molecules and accelerating the screening process. 71 Thus, it may be better utilized for largescale screening of potential flavor molecules. Notably, the screening of active ingredients for targeted receptors using the MS approach relies on high-quality protein structures to achieve accurate prediction and analysis. Although ∼200,000 protein structures have been solved, the high-resolution structures of some flavor-related receptors are still unavailable. 72 However, the rapid development of protein structure prediction algorithms, such as RosettaFold 73 and AlphaFold, 74 and cryo-electron microscopes means that the accessibility of protein structures may no longer be a limiting factor in the future.
ML-based approaches have much higher throughput than MS; however, it has two major limitations: (1) low-level interpretability and (2) the need for large-scale training data. Despite the reputation of ML as an "uninterpretable black box", it is still essential to understand how the model makes a prediction. Given this, algorithms such as SHapley Additive exPlanations (SHAP) 75 and Sure Independence Screening and Sparsifying Operator (SISSO) 76 have been proposed to "whiten" the black box by quantifying the contribution of features to the model's predictions. SHAP explains model outputs using the classic Shapley values from game theory and their related extensions, while SISSO combines symbolic regression and compressed sensing to identify the most important features that describe the target property or function. 75,76 Guo et al. have successfully used SHAP to analyze which descriptors have a close relationship with the astringency threshold. 77 Moreover, the recent development of interpretable molecular ML, such as an iteratively focused graph network, 78 has attempted to rank the contribution of each atom in compounds based on the model's attention weights to increase the interpretability of prediction. ML relies on large-scale training data to achieve high performance. However, only a tiny fraction of known flavor molecules has been included in public data sets, most of which are scattered among numerous literature reports and have not been systematically curated. 68,79 The lack of high-quality data sets can lead to studies being conducted using different training and testing data sets, making it difficult for readers to compare the performance of models. Thus, there is an urgent need to develop advanced text-mining algorithms to systematically extract flavor molecules and their properties from publications. In turn, this will help to create a comprehensive gold-standard data set to evaluate the performance of emerging ML algorithms for flavor property prediction in future studies.

FUTURE STRATEGIES FOR IDENTIFYING FLAVOR MOLECULES
High-throughput screening based on molecular simulation and ML enabled us to identify molecules with potential flavor properties from large-scale databases, such as COCONUT 80 and Super Natural. 81 However, the coverage of known molecules is still limited, with only 6% of the potential natural products evaluated. 82 The rapid development of genomic data has revealed that plants' biosynthesis capacity is vastly underappreciated, with millions of potential natural products awaiting discovery. 83 Emerging computational strategies such as multi-omics and artificial intelligence provide new opportunities for mining undiscovered natural flavor molecules from food and designing purpose-built safer artificial flavorings ( Figure 5).

Mining Natural Flavor Molecules Based on Multiomics.
In plants, genes involved in specialized metabolic pathways are encoded in biosynthetic gene clusters (BGCs) contiguously on the chromosome, which facilitates the elucidation of biosynthetic pathways, thereby facilitating the identification of natural flavor molecules 83 ( Figure 5A). Several computational softwares have been developed to identify BGCs across genome sequences, including antiSMASH, 84 PRISM, 85 and DeepBGC. 86 antiSMASH 84 was first released in 2011 and updated six times over 10 years. The software identifies regions at the gene cluster level based on profile hidden Markov models (pHMMs) and aligns them to their Based on plant genome and metabolome data, novel natural products are annotated using software, such as plantSMASH and NPLinker. Machine learning models could subsequently be used to predict the flavor characteristics of these natural products to discover novel natural flavor molecules. (B) Design of artificial flavor molecules based on molecular generation. By identifying molecular presentations (e.g., stringbased and molecular graphs) and functions that map a set of properties to a group of molecular structures, generative models could be used to rapidly identify diverse sets of molecules highly optimized for flavor characteristics. Note: SMILES, simplified molecular-input line-entry system.  85 DeepBGC is a deeplearning strategy to detect BGCs, which employs an RF classifier to predict the products of detected BGCs, offering an improved ability to identify new BGC classes. 86 These tools have been widely used for elucidating novel natural products and their molecular structures from bacterial and fungal genomes. However, most known flavorings are derived from plants. 87 To better fit the needs of plant BGC identification, Kautsar et al. developed plantiSMASH, 88 an analysis platform for the identification of candidate plant BGCs. They applied plantiSMASH to 48 high-quality plant genomes and identified a rich diversity of candidate plant BGCs, which prompted the identification of new phytochemicals. 88 The predictive ability of genome-based natural product annotation can be further enhanced through combination with other omics data. For example, fragmentation patterns observed in MS/MS spectra can assist in discovering metabolites and their biosynthetic genes. We could use software, such as NPLinker 89 to link BGCs and mass spectrometry data, thereby predicting novel natural products produced by plants, and then use ML models to predict the flavor class and intensity of these newly identified molecules to identify potential natural flavorings with better flavor properties ( Figure 5A).

Design of Artificial Alternatives
Based on Artificial Intelligence. The emerging application of artificial intelligence in cheminformatics, especially molecular generation, is another promising strategy for the design of artificial flavor molecules ( Figure 5B). The potential health risks of existing artificial sweeteners have encouraged scientists to design safer artificial sweeteners. Recently, de novo molecular design has been used in drug discovery, as it provides a reproducible methodology for artificial flavoring design. Generative models could generate molecules with desired flavor properties; therefore, these are favorable compared to designing molecules using human expertise. By identifying a function that maps a set of properties to a group of structures, generative models can rapidly identify diverse sets of molecules highly optimized for specific applications. 90 The successful application of molecular generation largely depends on input representation and the model architecture type. To generate novel molecules with specific flavor properties, known molecules are first converted into string-based representations or molecular graphs for model training. 91 These representations combined with the ability of deep neural networks are able to capture highly complex correlations between chemical structures and their flavor properties.
To date, molecular generative models have been used successfully for drug discovery. For example, Zhavoronkov et al. used a generative tensorial reinforcement learning model to successfully identify potent inhibitors of discoidin domain receptor 1 in 21 days, illustrating the potential of generative models for the rapid design of molecules that are synthetically feasible and possess potential innovative properties. 92 Skinnider et al. developed DarkNPS using a generative model to determine a statistical probability distribution over unobserved structures of psychoactive substances, in turn identifying potential new psychoactive substances. 93 Based on 1753 known psychoactive substances, they generated 8.9 million unique molecules with potential addiction. The documented successes of these practical applications encourage the use of de novo molecular generation for identifying novel flavor molecules. By identifying molecular presentations and functions that map a set of physicochemical properties to a group of molecular structures, generative models could rapidly predict diverse sets of molecules with highly optimized flavor characteristics ( Figure 5B).

DISCUSSION AND PERSPECTIVE
In this paper, we summarized 25 databases containing >14,000 unique flavor molecules (8982 molecules with known taste and 5046 with known aroma). We found that 5% of collected molecules have multiple tastes and 78% have multiple aromas, indicating the complexity of flavor perception. Although these databases have encouraged research in the field of flavor science, data in ∼70% of these databases were not downloaded or were only available upon request from the authors. This makes it difficult for users to assess data quality and integrity, in addition to causing limited reuse. Current studies also have a bias (>50%) toward bitter and sweet molecules compared to other sensations. As a result, other taste sensations have received less attention. This highlights the need to further annotate molecules with tastes such as sour, salty, and spicy in publications and experimental records. Furthermore, the content of most flavor molecules from natural resources is unavailable in any databases, which may limit their application in the industry.
Based on these data, molecular simulation and machine learning have been widely used to identify novel flavor molecules. Multiple types of data (e.g., molecule structures of flavor molecules and features extracted from mass spectra) and algorithms (e.g., RF, SVM, and CNN) have been used for ML modeling. These models help prioritize a large number of compounds in terms of their desired flavor properties as an in silico methodology, in turn significantly reducing the number of candidate chemicals for detailed sensory analyses. The feasibility and efficiency of ML modeling are widely accepted; however, issues with untimely updates, data inaccessibility, and code nondisclosure still remain. Therefore, we strongly encourage authors to make all data and code openly accessible during the publication process in future studies. Finally, we discussed the limitations and lack of current knowledge associated with poor coverage of known molecules and highlighted the future computational strategies for identifying novel flavor molecules. By harnessing the power of artificial intelligence and utilizing the wealth of multi-omics data, we will be able to uncover novel flavor compounds and gain a deeper understanding of the intricate interplay between molecules that shape our perceptions of taste and aroma. This could pave the way for the creation of innovative food products with rich flavor profiles and enhanced nutritional value. In future work, we will propose an impartial evaluation system for flavor molecule databases according to their data quality, availability, and transparency to advance findable, accessible, interoperable, and reusable research.

■ ASSOCIATED CONTENT Data Availability Statement
To facilitate further usage, we provide flavor molecule data collected from publicly available databases in a GitHub repository: https://github.com/DachuanZhang-FutureFood/ flavor-science.